Preface

For this problem set we will apply some of the approaches presented in ISLR for variable selection and model regularization to datasets that we have worked with previously. The goal will be to see whether more principled methods for model selection will allow us to better understand relative variable importance, variability of predictive performance of the models, etc.

In the preface we will use algae dataset that we used in the lectures, to illustrate some of the concepts and approaches utilized here. Just to do something different for a change, this time we will be modeling another outcome available available in the algae dataset, AG2. The problems that you are asked to explore on your own will use the fundraising dataset from the previous problem sets, but will be otherwise similar to the examples shown here. The flow of the presentation in this Preface also closely follows the outline of the Labs 6.5 and 6.6 in ISLR and you are encouraged to refer to them for additional examples and details.

algaeRaw <- read.table ("coil.analysis.data.txt", header=F, sep =",", row.names =NULL, na.strings ="XXXXXXX")
colnames (algaeRaw)= c("season","size","velocity",paste0("C",1:8),paste0("AG",1:7))
algaeRaw[1:3,]
##   season   size velocity   C1   C2    C3    C4      C5      C6      C7   C8 AG1
## 1 winter small_   medium 8.00  9.8 60.80 6.238 578.000 105.000 170.000 50.0 0.0
## 2 spring small_   medium 8.35  8.0 57.75 1.288 370.000 428.750 558.750  1.3 1.4
## 3 autumn small_   medium 8.10 11.4 40.02 5.330 346.667 125.667 187.057 15.6 3.3
##    AG2 AG3 AG4  AG5 AG6 AG7
## 1  0.0 0.0 0.0 34.2 8.3 0.0
## 2  7.6 4.8 1.9  6.7 0.0 2.1
## 3 53.6 1.9 0.0  0.0 0.0 9.7
# remove cases with undefined values and three outliers:
algae <- na.omit(algaeRaw)
algae <- algae[algae[,"C4"]<max(algae[,"C4"],na.rm=TRUE)&algae[,"C3"]<max(algae[,"C3"],na.rm=TRUE)&algae[,"AG4"]<max(algae[,"AG4"],na.rm=TRUE),]
# log-transform selected attributes:
for ( iCol in 1:8 ) {
  if ( iCol > 2 ) {
    algae[,paste0("C",iCol)] <- log(algae[,paste0("C",iCol)])
  }
  if ( iCol < 8 ) {
    algae[,paste0("AG",iCol)] <- log(1+algae[,paste0("AG",iCol)])
  }
}
# we'll use AG2 as an outcome here:
algaeAG2 <- algae[,!colnames(algae)%in%paste0("AG",c(1,3:7))]
pairs(algaeAG2[,-(1:3)])

Selecting the best variable subset on the entire dataset

Assuming that we have read and pre-processed algae data (omitted observations with undefined values, log-transformed where necessary and removed egregious outliers), let’s use regsubsets from library leaps to select optimal models with the number of terms ranging from one to all variables in the dataset. We will try each of the methods available for this function and collect corresponding model metrics (please notice that we override default value of nvmax argument; reflect on why we do is and why we use that specific value – remember that the goal here is to evaluate models up to and including those that include every predictor available):

summaryMetrics <- NULL
whichAll <- list()
# for each of the four methods we want to try:
for ( myMthd in c("exhaustive", "backward", "forward", "seqrep") ) {
  # (OK, here's your answer: 15 because three categorical attributes are 
  # represented by dummy variables) -- run regsubsets for the method myMthd:
  rsRes <- regsubsets(AG2~.,algaeAG2,method=myMthd,nvmax=15)
  summRes <- summary(rsRes) 
  whichAll[[myMthd]] <- summRes$which # get the variable choices for the current method
  # and also extract from the summary all the metrics we are interested in:
  for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
    summaryMetrics <- rbind(summaryMetrics,
                           data.frame(
                             method=myMthd,
                             metric=metricName,
                             nvars=1:length(summRes[[metricName]]),
                             value=summRes[[metricName]])
                           )
  }
}

#... and now plot everything:
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + 
  geom_path() + 
  geom_point() + 
  facet_wrap(~metric,scales="free") +   
  theme(legend.position="top")+
  theme_bw()

We can see that, (1) a couple of times sequential replacement selects models far more inferior to those selected by the rest of the methods; (2) the “backward” method comes up with models ever so slightly worse than the rest of the methods for variable numbers nvars=5:6. Otherwise, we generally come up with very comparable model performance by every associated metric.

Let us check which variables were actually selected by those different methods for each model size. We can quickly assess those selections by directly plotting the “yes/no” variable usage information encoded by which matrices (which we extracted in the above code from the model summaries and saved). The plots demonstrate that the first four variables are always the same and are selected in the same order as the model size increases, regardless of the method used: the variable that gets selected first is C8, followed by C1, then C3, and then medium size. For larger model sizes, the order in which different methods select additional variables beyond those four starts varying:

old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
# for each method used:
for ( myMthd in names(whichAll) ) {
  # get the previously saved 'which' matrtix and justy plot it using the 'mage' cmd:
  image(1:nrow(whichAll[[myMthd]]), 
        1:ncol(whichAll[[myMthd]]),
        whichAll[[myMthd]],
        xlab="N(vars)", ylab="",
        xaxt="n", yaxt="n", # suppress axis plotting, we will draw axes manually in a moment
        breaks=c(-0.5,0.5,1.5),
        col=c("white","gray"),main=myMthd)
  axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
  # use variable names as axis tick labels:
  axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}

par(old.par)

Using training and test data to select the best subset

As we have already discussed, it is not generally a great idea to select variables on the whole dataset. Choosing “the right variables” should rather be considered an integral part of the process of model fitting itself: which variables we choose is as much “fitting the data we have in hand” as finding the best model coefficients for any given set of the variables. Thus, we will follow the Lab 6.5.3 in ISLR and split our data approximately evenly into training and test, multiple times. Every time we will be selecting the best subsets of variables (for different model sizes) on the training data only, and then evaluating the performance of those models on the training and test sets. Since with different samples of training data we can, conceivably be selecting different variables (which is the whole idea!), we will also need to record which variables have been selected each time.

In order to be able to use regsubsets output to make predictions, we follow ISLR and setup an appropriate predict function that can be directly applied to the objects returned by regsubsets (notice .regsubsets in its name – this is how under S3 OOP framework in R methods are matched to corresponding classes; in the code that follows we will be just passing objects of class regsubsets the “generic” predict function, and the latter will know to dispatch the call to the specialized implementation predict.rebsubsets() by simply matching the class name):

# takes `regsubsets` object (as returned by function `regsubsets()`), and also
# the model "id" - since regsubsets stores the best selections for *each* 
# model size the function was run for! Applies selected best model to the newdata
# and returns the predictions
predict.regsubsets <- function (object, newdata, id, ...){
  form=as.formula(object$call [[2]]) # get the formula used for model selection
  # convert data into model matrix (will take care of 1-hot encoding as needed)
  mat=model.matrix(form,newdata)  
  coefi=coef(object,id=id) # get the fitted coefficients of the (best) model of the given size
  xvars=names (coefi) # figure out which variables are actually used by that model
  mat[,xvars] %*% coefi # extract just those variables from the data and compute predictions
}

Now we are all set to repeatedly draw training sets, choose the best set of variables on them by each of the four different methods available in regsubsets, calculate test error on the remaining samples, etc. To summarize variable selection over multiple splits of the data into training and test, we will use a 3-dimensional array whichSum – first two dimensions match the which matrix (i.e. the number of variables used, which is 15 in this case as we discussed above), and the third dimension corresponds to the four methods available in regsubsets.

(if you find it difficult thinking in terms of 3D-arrays, you can change the code to make whichSum a list of four 15x16 matrices, each corresponding to specific variable selection method).

In order to split the data into training and test we will again use sample() function – those who are curious and are paying attention may want to reflect on the difference in how it is done below and how it is implemented in the Ch. 6.5.3 of ISLR and think about the consequences. (Hint: consider how size of training or test datasets will vary from one iteration to another in these two implementations)

dfTmp <- NULL
whichSum <- array(0,dim=c(15,16,4),
  dimnames=list(NULL,colnames(model.matrix(AG2~.,algaeAG2)),
     c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(algaeAG2)))
  # Try each method available in regsubsets
  # to select the best model of each size, using training data only:
  for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
    rsTrain <- regsubsets(AG2~.,algaeAG2[bTrain,],nvmax=15,method=jSelect)
    # Add up variable selections: each cell of the resulting matrix will literally
    # count *how many times* each particular variable was selected for each 
    # particular model size across all the random resamplings of the data:
    whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
    # Calculate test error for each set of variables (each model size)
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:15 ) {
      # make predictions:
      testPred <- predict(rsTrain,algaeAG2[!bTrain,],id=kVarSet)
      # calculate MSE:
      mseTest <- mean((testPred-algaeAG2[!bTrain,"AG2"])^2)
      # add to data.frame for future plotting:
      dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
      mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
    }
  }
}
# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)+theme_bw()

We can see clear difference in the behavior of training and test error with the increase in the number of attributes added to the model. The training error gradually decreases (although for larger numbers of predictors included in the model, the difference between median errors is small comparing to the variability of the error across multiple splits of the data into training and test). Test error shows clear decrease upon addition of the second attribute to the model followed by steady increase with the inclusion of three or more attributes in the model. Therefore we can conclude that models with three or more attributes are very likely to overfit in this case.

Below we plot the fraction of times each variable was included in the best model of every given size, by each of the four methods (darker shades of gray indicate the fraction closer to 1). We can see from the plots that the selection of C1 and C8 as the first two best variables is quite consistent (although the order may be different, depending on the training data (!!); and on some rare occasions – i.e. very light shades of gray, – even some other variables may be selected as the best ones for \(n=1\) or \(n=2\)-variable model). Past the model sizes 3-4, it looks like pretty much any variable can be picked as the next-best one depending on particular train-test split, with comparable frequencies. This is consistent with the observation that models beyond that size are overfitting, so they are indeed too eager to pick any variable that happens to fit the noise (!) better in any particular data split.

old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in dimnames(whichSum)[[3]] ) {
  tmpWhich <- whichSum[,,myMthd] / nTries
  image(1:nrow(tmpWhich),1:ncol(tmpWhich),tmpWhich,
        xlab="N(vars)",ylab="",xaxt="n",yaxt="n",main=myMthd,
        breaks=c(-0.1,0.1,0.25,0.5,0.75,0.9,1.1),
        col=c("white","gray90","gray75","gray50","gray25","gray10"))
  axis(1,1:nrow(tmpWhich),rownames(tmpWhich))
  axis(2,1:ncol(tmpWhich),colnames(tmpWhich),las=2)
}

par(old.par)

Similar observations can be made using cross-validation rather than the split of the dataset into training and test that is omitted here for the purposes of brevity.

Ridge for variable selection:

As explained in the lecture and ISLR Ch.6.6 lasso and ridge regression can be performed by glmnet function from library glmnet – its argument alpha governs the form of the shrinkage penalty, so that alpha=0 corresponds to ridge and alpha=1 – to lasso regression. The arguments to glmnet differ from those used for lm for example and require specification of the matrix of predictors and outcome separately. model.matrix is particularly helpful for specifying matrix of predictors by creating dummy variables for categorical predictors:

# -1 to get rid of intercept that glmnet knows to include:
x <- model.matrix(AG2~.,algaeAG2)[,-1]
head(algaeAG2)
##   season   size velocity   C1   C2       C3        C4       C5       C6
## 1 winter small_   medium 8.00  9.8 4.107590 1.8306596 6.359574 4.653960
## 2 spring small_   medium 8.35  8.0 4.056123 0.2530906 5.913503 6.060874
## 3 autumn small_   medium 8.10 11.4 3.689379 1.6733512 5.848365 4.833636
## 4 spring small_   medium 8.07  4.8 4.348522 0.8337783 4.586823 4.113853
## 5 autumn small_   medium 8.06  9.0 4.013677 2.3433431 5.454038 4.064263
## 6 winter small_   high__ 8.25 13.1 4.185860 2.2244073 6.063785 2.904165
##         C7        C8      AG2
## 1 5.135798 3.9120230 0.000000
## 2 6.325702 0.2623643 2.151762
## 3 5.231413 2.7472709 4.000034
## 4 4.932313 0.3364722 3.737670
## 5 4.580673 2.3513753 1.360977
## 6 4.037192 3.3463891 2.747271
# notice how it created dummy variables for categorical attributes
head(x)
##   seasonspring seasonsummer seasonwinter sizemedium sizesmall_ velocitylow___
## 1            0            0            1          0          1              0
## 2            1            0            0          0          1              0
## 3            0            0            0          0          1              0
## 4            1            0            0          0          1              0
## 5            0            0            0          0          1              0
## 6            0            0            1          0          1              0
##   velocitymedium   C1   C2       C3        C4       C5       C6       C7
## 1              1 8.00  9.8 4.107590 1.8306596 6.359574 4.653960 5.135798
## 2              1 8.35  8.0 4.056123 0.2530906 5.913503 6.060874 6.325702
## 3              1 8.10 11.4 3.689379 1.6733512 5.848365 4.833636 5.231413
## 4              1 8.07  4.8 4.348522 0.8337783 4.586823 4.113853 4.932313
## 5              1 8.06  9.0 4.013677 2.3433431 5.454038 4.064263 4.580673
## 6              0 8.25 13.1 4.185860 2.2244073 6.063785 2.904165 4.037192
##          C8
## 1 3.9120230
## 2 0.2623643
## 3 2.7472709
## 4 0.3364722
## 5 2.3513753
## 6 3.3463891
y <- algaeAG2[,"AG2"]
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)

Plotting output of glmnet illustrates change in the contributions of each of the predictors as amount of shrinkage changes. In ridge regression each predictor contributes more or less over the entire range of shrinkage levels.

Output of cv.glmnet shows averages and variabilities of MSE in cross-validation across different levels of regularization. lambda.min field indicates values of \(\lambda\) at which the lowest average MSE has been achieved, lambda.1se shows larger \(\lambda\) (more regularization) that has MSE 1SD (of cross-validation) higher than the minimum – this is an often recommended \(\lambda\) to use under the idea that it will be less susceptible to overfit. You may find it instructive to experiment by providing different levels of lambda other than those used by default to understand sensitivity of gv.glmnet output to them. predict depending on the value of type argument allows to access model predictions, coefficients, etc. at a given level of lambda:

cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)

cvRidgeRes$lambda.min
## [1] 0.8415975
cvRidgeRes$lambda.1se
## [1] 3.7288
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                           s1
## (Intercept)    -3.4201892833
## seasonspring    0.0037729438
## seasonsummer   -0.0060035842
## seasonwinter   -0.0975006229
## sizemedium     -0.1480074660
## sizesmall_     -0.0593286200
## velocitylow___  0.1251026812
## velocitymedium  0.1729267672
## C1              0.4956087322
## C2             -0.0005789763
## C3              0.1063279517
## C4             -0.0087399612
## C5             -0.0163725209
## C6              0.0739619299
## C7              0.0308752689
## C8              0.1288509916
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                          s1
## (Intercept)    -1.134278752
## seasonspring    0.008454967
## seasonsummer   -0.012326674
## seasonwinter   -0.035164346
## sizemedium     -0.048241805
## sizesmall_     -0.061382041
## velocitylow___  0.091905902
## velocitymedium  0.086372127
## C1              0.242471637
## C2             -0.006082201
## C3              0.060972178
## C4              0.001838859
## C5              0.003395528
## C6              0.047061060
## C7              0.039129964
## C8              0.068218969
# and with lambda's other than default:
cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-80:80)/20))
plot(cvRidgeRes)

Similarly to what was observed for variable selection methods above, plot of cross-validation error for ridge regression has well-defined minimum indicating that some amount of regularization is necessary for the model using all attributes to prevent overfitting. Notice that minimum \(MSE\simeq 1.25\) from ridge regression here is very comparable to the minimum observed above for average test error when variables were selected by regsubsets.

Relatively higher contributions of C1 and C8 to the model outcomed are more apparent for the results of ridge regression performed on centered and, more importantly, scaled matrix of predictors:

ridgeResScaled <- glmnet(scale(x),y,alpha=0)
plot(ridgeResScaled)

cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0)
plot(cvRidgeResScaled)

predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                          s1
## (Intercept)     1.453178693
## seasonspring    0.003691324
## seasonsummer   -0.005260707
## seasonwinter   -0.016378543
## sizemedium     -0.024054178
## sizesmall_     -0.028723168
## velocitylow___  0.034721169
## velocitymedium  0.042665971
## C1              0.114287215
## C2             -0.014465684
## C3              0.068388149
## C4              0.001705746
## C5              0.004647950
## C6              0.059777100
## C7              0.044104199
## C8              0.097283518

Lasso for variable selection

Lasso regression is done by the same call to glmnet except that now alpha=1. One can see now how more coefficients become zeroes with increasing amount of shrinkage. Notice that amount of regularization increases from right to left when plotting output of glmnet and from left to right when plotting output of cv.glmnet.

lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)

cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)

# With other than default levels of lambda:
cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-120:0)/20))
plot(cvLassoRes)

Also well-defined minimum of cross-validation MSE for lasso regularization.

predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                         s1
## (Intercept)    -2.58075850
## seasonspring    .         
## seasonsummer    .         
## seasonwinter    .         
## sizemedium      .         
## sizesmall_      .         
## velocitylow___  .         
## velocitymedium  .         
## C1              0.45361507
## C2              .         
## C3              0.04081512
## C4              .         
## C5              .         
## C6              .         
## C7              .         
## C8              0.13480989
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                         s1
## (Intercept)    -4.89634959
## seasonspring    .         
## seasonsummer    .         
## seasonwinter    .         
## sizemedium     -0.02207742
## sizesmall_      .         
## velocitylow___  .         
## velocitymedium  0.07118428
## C1              0.68743275
## C2              .         
## C3              0.09713636
## C4              .         
## C5              .         
## C6              0.04660259
## C7              .         
## C8              0.16402891

As explained above and illustrated in the plots for the output of cv.glmnet, lambda.1se typically corresponds to more shrinkage with more coefficients set to zero by lasso. Use of scaled predictors matrix makes for more apparent contributions of C1 and C8, and to smaller degree, C3:

lassoResScaled <- glmnet(scale(x),y,alpha=1)
plot(lassoResScaled)

cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
plot(cvLassoResScaled)

predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)    1.45317869
## seasonspring   .         
## seasonsummer   .         
## seasonwinter   .         
## sizemedium     .         
## sizesmall_     .         
## velocitylow___ .         
## velocitymedium .         
## C1             0.20081746
## C2             .         
## C3             0.03376726
## C4             .         
## C5             .         
## C6             .         
## C7             .         
## C8             0.18722644

Lasso on train/test datasets:

Lastly, we can run lasso on several training datasets and calculate corresponding test MSE and frequency of inclusion of each of the coefficients in the model:

lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
  cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1)
  lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1)
  lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
  lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
  lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
  lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 1.346731
lassoCoefCnt
##   seasonspring   seasonsummer   seasonwinter     sizemedium     sizesmall_ 
##              0              0              0              0              0 
## velocitylow___ velocitymedium             C1             C2             C3 
##              1              1             28              0              7 
##             C4             C5             C6             C7             C8 
##              0              0              9              0             25

One can conclude that typical lasso model includes two, sometimes three, coefficients and (by comparison with some of the plots above) that its test MSE is about what was observed for two to three variable models as chosen by the best subset selection approach.

Problem 1: the best subset selection (15 points)

We will use the same fundraising dataset from the week 4 problem set (properly preprocessed: shifted/log-transformed, original predictions provided alongside the data being excluded).

It is up to you whether you want to include gender attribute in your analysis. It is a categorical attribute and as such it has to be correctly represented by dummy variable(s).

If you do it properly (and the preface supplies examples of doing that), you will be getting three extra points for each problem in this week assignment that (correctly!) includes gender variable into the analysis, for the possible total extra of 3x4=12 points.

If you prefer not to add this extra work, then excluding gender from the data at the outset (as you were instructed to do for week 4) is probably the cleanest way to prevent it from getting in the way of your computations; in that case the total points you can earn for this assignment reverts to the standard 60 points.

#process data
fundata = read.csv("fund-raising-with-predictions.csv")
fundata$gender = as.numeric(fundata$gender == "F") #female = 1; male = 0
data = subset(fundata, select = c(-predcontr))
data[,-13] = log(data[,-13] + 1)
#subset selection
summetrics = NULL
whichAll = list()

for (method in c("exhaustive", "backward", "forward", "seqrep")){
  rsresults = regsubsets(contrib ~ ., data = data, method = method, nvmax = 12)
  sumresults = summary(rsresults)
  whichAll[[method]] = sumresults$which
  
  for (metric in c("rsq", "rss", "adjr2", "cp", "bic"))
  {
    summetrics = rbind(summetrics, data.frame(
                                      method = method,
                                      metric = metric,
                                      nvars = 1:length(sumresults[[metric]]),
                                      value = sumresults[[metric]])
                                    )
  }
}
#plot metrics
ggplot(summetrics, aes(x = nvars, y = value, shape = method, color = method)) +
  geom_path() +
  geom_point() +
  facet_wrap(~metric, scales = "free") +
  theme(legend.position = "top") + 
  theme_bw() +
  scale_x_continuous(breaks = seq(0, 12, by = 2))

#which variables
par(mfrow = c(2,2), ps = 14, mar = c(5, 7, 2, 1))

for (method in names(whichAll) ) {
  image(1:nrow(whichAll[[method]]), 
        1:ncol(whichAll[[method]]),
        whichAll[[method]],
        xlab="N(vars)", 
        ylab="",
        xaxt="n", 
        yaxt="n", 
        breaks=c(-0.5,0.5,1.5),
        col=c("white","pink"),
        main=method)
  axis(1,1:nrow(whichAll[[method]]),rownames(whichAll[[method]]))
  axis(2,1:ncol(whichAll[[method]]),colnames(whichAll[[method]]), las = 2)
}

The plots show that each of the methods are generally in agreement and show 7 variables appear to be optimal, with the highest AdjR2 and R2, and lowest BIC, Cp, and RSS. At 4 variables, the backward method shows a slightly worse mode than the other methods. At 7 and 9 variables, sequential replacement selects inferior models compared to the other methods.

All of the methods include avecontr first, followed by lastcontr and maxdate. After that, all of the methods include mincontrib, with the exeption of the backward method, which chose mindate. Mindate is the fifth variable for all of the other methods; however, the fifth variable for the backward method is maxcontrib. For 6 and 7 variables, the forward method selected ncontrib and maxcontrib, respectively. The backward method selected ncontrib and mincontrib. Seqrep showed both maxcontrib and ncontrib for 6, and both promocontr and gapmos for 7, which were both ranked lower in the other methods. Gender and age were included 11th and 12th in all of the methods.

Problem 2: the best subset on training/test data (15 points)

#split testing / training
predict.regsubsets = function (object, newdata, id, ...){
  form = as.formula(object$call[[2]])
  mat = model.matrix(form,newdata)  
  coefi = coef(object, id=id)
  xvars=names(coefi)
  mat[,xvars] %*% coefi
}
dfTmp = NULL
whichSum = array(0,dim=c(12,13,4),
  dimnames = list(NULL,colnames(model.matrix(contrib ~.,data)),
     c("exhaustive", "backward", "forward", "seqrep")))

# Split data into training and test 30 times:
nTries = 30
for (iTry in 1:nTries ) {
  bTrain = sample(rep(c(TRUE,FALSE),length.out=nrow(data)))
  for (jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
    rsTrain = regsubsets(contrib ~ ., data[bTrain,], nvmax=12, method=jSelect)
    whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
    
    #calculate test error for each set of variables (each model size)
    for (kVarSet in 1:12 ) {
      testPred = predict(rsTrain, data[!bTrain,],id=kVarSet)
      mseTest = mean((testPred-data[!bTrain,"contrib"])^2)
      dfTmp = rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
      mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
    }
  }
}

ggplot(dfTmp,aes(x=factor(vars),y=mse,color=sel)) + geom_boxplot()+facet_wrap(~trainTest)+theme_bw()

predcontrmse = mean((data$contrib - log(fundata$predcontr + 1)) ^ 2)
paste("predcontr MSE:", round(predcontrmse, 2))
## [1] "predcontr MSE: 0.17"

With 7 variables, MSE appears to be the lowest for both the testing and training data, at below 0.12 median with slightly lower training error. The error rate remains stable with the addition of more variables, suggesting that the additional variables will not add much value. The different methods are largely in agreement with eachother; however, seqrep shows higher MSE at some model sizes compared to the other methods. This error is lower than the MSE of the predcontr predictions, which was approximately 0.17.

Problem 3: lasso regression (15 points)

#lasso regression model
x = model.matrix(contrib ~ ., data)[,-1]
y = data[,"contrib"]

lass = glmnet(x, y, alpha = 1)
cvlass = cv.glmnet(x, y, alpha = 1)

par(mfrow = c(1,2))
plot(lass); plot(cvlass)

cvlass$lambda.min; cvlass$lambda.1se
## [1] 0.001625733
## [1] 0.03502535
predict(lass, type = "coefficients", s = cvlass$lambda.min)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -4.7338310744
## gapmos       0.0023251184
## promocontr  -0.0051650684
## mincontrib  -0.0506616350
## ncontrib    -0.0522508602
## maxcontrib   0.1228826059
## lastcontr    0.3405551248
## avecontr     0.5033110372
## mailord     -0.0030292650
## mindate     -1.0866780143
## maxdate      1.6459934869
## age          .           
## gender      -0.0001368809
predict(lass, type = "coefficients", s = cvlass$lambda.1se)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -4.1837284392
## gapmos       .           
## promocontr   .           
## mincontrib   .           
## ncontrib    -0.0005258808
## maxcontrib   0.0980294430
## lastcontr    0.3600084275
## avecontr     0.4300831123
## mailord      .           
## mindate      .           
## maxdate      0.4990649734
## age          .           
## gender       .
cvLassRes = cv.glmnet(x,y,alpha=1,lambda=10^((-80:80)/20))
plot(cvLassRes)
cvLassRes2 = cv.glmnet(x, y, alpha = 1, lambda = 10^((-120:0)/20))
plot(cvLassRes2)

The model coefficients using 1SE are lower than some of the coefficients using lambda.min. For example avecontr (1SE) is 0.43, wheras avecontr (min) is 0.49. However, this trend is not consistent across all of the variables. Fewer coefficients are included with 1SE than with min. 1SE includes ncontrib, maxcontrib, lastcontr, avecontr, and maxdate. Min includes all of these plus mailord and mindate. The range of log(lambda) below approximately -4 or -5 resulted in the low plateau of MSE, to the level shown above. Log(lambda) above approximately 0 resulted in a high MSE plateau around 0.33.

Problem 4: lasso in resampling (15 points)

lassoCoefCnt = 0
lassoMSE = NULL

for (iTry in 1:30 ) {
  bTrain = sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
  cvLassoTrain = cv.glmnet(x[bTrain,],y[bTrain],alpha=1)
  lassoTrain = glmnet(x[bTrain,],y[bTrain],alpha=1)
  lassoTrainCoef = predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
  lassoCoefCnt = lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
  lassoTestPred = predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
  lassoMSE = c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 0.1277493
lassoCoefCnt
##     gapmos promocontr mincontrib   ncontrib maxcontrib  lastcontr   avecontr 
##          0          0          0          0         28         30         30 
##    mailord    mindate    maxdate        age     gender 
##          0          0          6          0          0

Through resampling, the typical model size includes at least 3 variables, but no more than 6, which contrasts with the previous best subset selection which showed the lowest error with 7 variables. The mean MSE is 0.126, which is slightly higher than the minimum MSE seen with the best subset selection above.